{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dHxFf9hpq3pl"
   },
   "source": [
    "# Torch-Rechub Tutorial：Multi-Task\n",
    "- 场景：精排（多任务学习）\n",
    "\n",
    "- 模型：ESMM、MMOE\n",
    "\n",
    "- 数据：Ali-CCP数据集\n",
    "\n",
    "- 学习目标\n",
    "\n",
    "  - 学会使用torch-rechub训练一个ESMM模型\n",
    "  - 学会基于torch-rechub训练一个MMOE模型\n",
    "\n",
    "- 学习材料：\n",
    "\n",
    "  - 多任务模型介绍：https://datawhalechina.github.io/fun-rec/#/ch02/ch2.2/ch2.2.5/2.2.5.0\n",
    "\n",
    "  - Ali-CCP数据集官网：https://tianchi.aliyun.com/dataset/dataDetail?dataId=408\n",
    "\n",
    "- 注意事项：本教程模型部分的超参数并未调优，欢迎小伙伴在学完教程后参与调参和在全量数据上进行测评工作"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jHcRLfMuN8z3"
   },
   "outputs": [],
   "source": [
    "#安装torch-rechub\n",
    "!pip install torch-rechub"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vtczGZ_nvP69"
   },
   "source": [
    "## Ali-CCP数据集介绍\n",
    "- [原始数据](https://tianchi.aliyun.com/dataset/dataDetail?dataId=408)：原始数据采集自手机淘宝移动客户端的推荐系统日志，一共有23个sparse特征，8个dense特征，包含“点击”、“购买”两个标签，各特征列的含义参考学习材料中的Ali-CCP数据集官网上的详细描述\n",
    "\n",
    "- [全量数据](https://cowtransfer.com/s/1903cab699fa49)：我们已经完成对原始数据集的处理，包括对sparse特征进行Lable Encode，dense特征采用归一化处理等。预处理脚本见torch-rechub/examples/ranking/data/ali-ccp/preprocess_ali_ccp.py\n",
    "\n",
    "- [采样数据](https://github.com/datawhalechina/torch-rechub/tree/main/examples/ranking/data/ali-ccp)：从全量数据集采样的小数据集，供大家调试代码和学习使用，因此本次教程使用采样数据\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 411
    },
    "id": "FI7Sb5_vvHgg",
    "outputId": "ca9a0975-21fc-4263-d3e5-4d2dbb8f4cee"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train : val : test = 100 50 50\n",
      "   click  purchase  101  121  122  124  125  126  127  128  ...  127_14  \\\n",
      "0      0         0    1    1    1    1    1    0    1    1  ...       1   \n",
      "1      0         0    1    1    1    1    1    0    1    1  ...       1   \n",
      "2      1         1    1    1    1    1    1    0    1    1  ...       1   \n",
      "3      0         0    1    1    1    1    1    0    1    1  ...       1   \n",
      "4      0         0    1    1    1    1    1    0    1    1  ...       1   \n",
      "\n",
      "   150_14  D109_14  D110_14  D127_14  D150_14     D508   D509    D702     D853  \n",
      "0       1   0.4734    0.562   0.0856   0.1902  0.07556  0.000  0.0000  0.00000  \n",
      "1       1   0.4734    0.562   0.0856   0.1902  0.00000  0.000  0.0000  0.00000  \n",
      "2       1   0.4734    0.562   0.0856   0.1902  0.56050  0.256  0.4626  0.34400  \n",
      "3       1   0.4734    0.562   0.0856   0.1902  0.26150  0.000  0.0000  0.12213  \n",
      "4       1   0.4734    0.562   0.0856   0.1902  0.35910  0.000  0.0000  0.00000  \n",
      "\n",
      "[5 rows x 33 columns]\n"
     ]
    }
   ],
   "source": [
    "#使用pandas加载数据\n",
    "import pandas as pd\n",
    "data_path = '../examples/ranking/data/ali-ccp' #数据存放文件夹\n",
    "df_train = pd.read_csv(data_path + '/ali_ccp_train_sample.csv') #加载训练集\n",
    "df_val = pd.read_csv(data_path + '/ali_ccp_val_sample.csv') #加载验证集\n",
    "df_test = pd.read_csv(data_path + '/ali_ccp_test_sample.csv') #加载测试集\n",
    "print(\"train : val : test = %d %d %d\" % (len(df_train), len(df_val), len(df_test)))\n",
    "#查看数据，其中'click'、'purchase'为标签列，'D'开头为dense特征列，其余为sparse特征，各特征列的含义参考官网描述\n",
    "print(df_train.head(5)) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9DSxOha57FKh"
   },
   "source": [
    "### 使用torch-rechub训练ESMM模型\n",
    "\n",
    "#### 数据预处理\n",
    "在数据预处理过程通常需要:\n",
    "- 对稀疏分类特征进行Lable Encode\n",
    "- 对于数值特征进行分桶或者归一化\n",
    "\n",
    "由于本教程中的采样数据以及全量数据已经进行预处理，因此加载数据集可以直接使用。\n",
    "\n",
    "本次的多任务模型的任务是预测点击和购买标签，是推荐系统中典型的CTR和CVR预测任务。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "id": "PtJQk9zBMa0H"
   },
   "outputs": [],
   "source": [
    "train_idx, val_idx = df_train.shape[0], df_train.shape[0] + df_val.shape[0]\n",
    "data = pd.concat([df_train, df_val, df_test], axis=0)\n",
    "#task 1 (as cvr): main task, purchase prediction\n",
    "#task 2(as ctr): auxiliary task, click prediction\n",
    "data.rename(columns={'purchase': 'cvr_label', 'click': 'ctr_label'}, inplace=True)\n",
    "data[\"ctcvr_label\"] = data['cvr_label'] * data['ctr_label']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VM3jT16DFc3M"
   },
   "source": [
    "#### 定义模型\n",
    "定义一个模型需要指定模型结构参数,需要哪些参数可查看对应模型的定义部分。 \n",
    "对于ESMM而言，主要参数如下：\n",
    "\n",
    "- user_features指用户侧的特征，只能传入sparse类型（论文中需要分别对user和item侧的特征进行sum_pooling操作）\n",
    "- item_features指用item侧的特征，只能传入sparse类型\n",
    "- cvr_params指定CVR Tower中MLP层的参数\n",
    "- ctr_params指定CTR Tower中MLP层的参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Fawews-U2ZBT",
    "outputId": "f10931f1-04d7-448e-e6df-cc5978f68e81"
   },
   "outputs": [],
   "source": [
    "from torch_rechub.models.multi_task import ESMM\n",
    "from torch_rechub.basic.features import DenseFeature, SparseFeature\n",
    "\n",
    "col_names = data.columns.values.tolist()\n",
    "dense_cols = ['D109_14', 'D110_14', 'D127_14', 'D150_14', 'D508', 'D509', 'D702', 'D853']\n",
    "sparse_cols = [col for col in col_names if col not in dense_cols and col not in ['cvr_label', 'ctr_label', 'ctcvr_label']]\n",
    "print(\"sparse cols:%d dense cols:%d\" % (len(sparse_cols), len(dense_cols)))\n",
    "label_cols = ['cvr_label', 'ctr_label', \"ctcvr_label\"]  #the order of 3 labels must fixed as this\n",
    "used_cols = sparse_cols #ESMM only for sparse features in origin paper\n",
    "item_cols = ['129', '205', '206', '207', '210', '216']  #assumption features split for user and item\n",
    "user_cols = [col for col in used_cols if col not in item_cols]\n",
    "user_features = [SparseFeature(col, data[col].max() + 1, embed_dim=16) for col in user_cols]\n",
    "item_features = [SparseFeature(col, data[col].max() + 1, embed_dim=16) for col in item_cols]\n",
    "\n",
    "model = ESMM(user_features, item_features, cvr_params={\"dims\": [16, 8]}, ctr_params={\"dims\": [16, 8]})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "f-eMJYhhOQLA"
   },
   "source": [
    "#### 构建dataloader\n",
    "\n",
    "构建dataloader通常由\n",
    "1. 构建输入字典（字典的键为定义模型时采用的特征名，值为对应特征的数据）\n",
    "2. 通过字典构建相应的dataset和dataloader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "4qtVwFdc4p64",
    "outputId": "4eae45de-3a5d-4a53-d9e6-0715543edae4"
   },
   "outputs": [],
   "source": [
    "from torch_rechub.utils.data import DataGenerator\n",
    "\n",
    "x_train, y_train = {name: data[name].values[:train_idx] for name in used_cols}, data[label_cols].values[:train_idx]\n",
    "x_val, y_val = {name: data[name].values[train_idx:val_idx] for name in used_cols}, data[label_cols].values[train_idx:val_idx]\n",
    "x_test, y_test = {name: data[name].values[val_idx:] for name in used_cols}, data[label_cols].values[val_idx:]\n",
    "dg = DataGenerator(x_train, y_train)\n",
    "train_dataloader, val_dataloader, test_dataloader = dg.generate_dataloader(x_val=x_val, y_val=y_val, \n",
    "                                      x_test=x_test, y_test=y_test, batch_size=1024)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KvweU0m_RIFg"
   },
   "source": [
    "#### 训练模型及测试\n",
    "\n",
    "- 训练模型通过相应的trainer进行，对于多任务的MTLTrainer需要设置任务的类型、优化器的超参数和优化策略等。\n",
    "\n",
    "- 完成模型训练后对测试集进行测试\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "id": "6WM8FiKbQcm2",
    "outputId": "823e8a8d-4e21-4b52-8885-ecfea5906355"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "train: 100%|██████████| 1/1 [00:11<00:00, 11.81s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train loss:  {'task_0:': 0.8062036633491516, 'task_1:': 0.7897741794586182}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "validation: 100%|██████████| 1/1 [00:08<00:00,  8.44s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch: 0 validation scores:  [0.9183673469387755, 0.553191489361702]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "validation: 100%|██████████| 1/1 [00:07<00:00,  7.99s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "test auc: [0.4693877551020408, 0.8333333333333333]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "import os\n",
    "from torch_rechub.trainers import MTLTrainer\n",
    "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
    "learning_rate = 1e-3\n",
    "epoch = 1 #10\n",
    "weight_decay = 1e-5\n",
    "save_dir = '../examples/ranking/data/ali-ccp/saved'\n",
    "if not os.path.exists(save_dir):\n",
    "    os.makedirs(save_dir)\n",
    "task_types = [\"classification\", \"classification\", \"classification\"] #CTR与CVR均为二分类任务\n",
    "mtl_trainer = MTLTrainer(model, task_types=task_types, \n",
    "              optimizer_params={\"lr\": learning_rate, \"weight_decay\": weight_decay}, \n",
    "              n_epoch=epoch, earlystop_patience=1, device=device, model_path=save_dir)\n",
    "mtl_trainer.fit(train_dataloader, val_dataloader)\n",
    "auc = mtl_trainer.evaluate(mtl_trainer.model, test_dataloader)\n",
    "print(f'test auc: {auc}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eEo3cm-JTqEn"
   },
   "source": [
    "### 使用torch-rechub训练MMOE模型\n",
    "训练MMOE模型的流程与ESMM模型十分相似\n",
    "\n",
    "需要注意的是MMOE模型同时支持dense和sparse特征作为输入,以及支持分类和回归任务混合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "id": "KwvxQpAwSEak",
    "outputId": "d94c538f-f5a6-42fe-8fc2-525b2bf43700"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "train: 100%|██████████| 1/1 [00:07<00:00,  7.90s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train loss:  {'task_0:': 0.732882022857666, 'task_1:': 0.6457288861274719}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "validation: 100%|██████████| 1/1 [00:07<00:00,  7.84s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch: 0 validation scores:  [0.9591836734693877, 0.5354609929078014]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "validation: 100%|██████████| 1/1 [00:07<00:00,  7.62s/it]\n"
     ]
    }
   ],
   "source": [
    "from torch_rechub.models.multi_task import MMOE\n",
    "# 定义模型\n",
    "used_cols = sparse_cols + dense_cols\n",
    "features = [SparseFeature(col, data[col].max()+1, embed_dim=4)for col in sparse_cols] \\\n",
    "                   + [DenseFeature(col) for col in dense_cols]\n",
    "model = MMOE(features, task_types, 8, expert_params={\"dims\": [16]}, tower_params_list=[{\"dims\": [8]}, {\"dims\": [8]}])\n",
    "#构建dataloader\n",
    "label_cols = ['cvr_label', 'ctr_label']\n",
    "x_train, y_train = {name: data[name].values[:train_idx] for name in used_cols}, data[label_cols].values[:train_idx]\n",
    "x_val, y_val = {name: data[name].values[train_idx:val_idx] for name in used_cols}, data[label_cols].values[train_idx:val_idx]\n",
    "x_test, y_test = {name: data[name].values[val_idx:] for name in used_cols}, data[label_cols].values[val_idx:]\n",
    "dg = DataGenerator(x_train, y_train)\n",
    "train_dataloader, val_dataloader, test_dataloader = dg.generate_dataloader(x_val=x_val, y_val=y_val, \n",
    "                                      x_test=x_test, y_test=y_test, batch_size=1024)\n",
    "#训练模型及评估\n",
    "mtl_trainer = MTLTrainer(model, task_types=task_types, optimizer_params={\"lr\": learning_rate, \"weight_decay\": weight_decay}, n_epoch=epoch, earlystop_patience=30, device=device, model_path=save_dir)\n",
    "mtl_trainer.fit(train_dataloader, val_dataloader)\n",
    "auc = mtl_trainer.evaluate(mtl_trainer.model, test_dataloader)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Torch-Rechub Tutorial：Multi-Task.ipynb",
   "provenance": []
  },
  "interpreter": {
   "hash": "2f0699014af7f4c9080a159fe6ab9f0087a283cb8192b31d41a414a088fd29ff"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
